Effects of COVID-19 pandemic on hospital-acquired infections
BMIN503/EPID600 Final Project
Author
Kevin Mears
1 Overview
The goal of my final project is to determine how the COVID-19 pandemic and the surge in hospital visits, changes in cleaning/hygiene, and people staying at home affected the incidence of hospital-associated infections. I identified the 2020 data set from the CDC National Hospital Care Survey (NHCS) which contains 132,694 inpatients and 388,753 emergency department visits. It contains information including patient age, sex, the month they were discharged, and their diagnoses. I will use this data to determine how the incidence of hospital-associated infections changed month-to-month in 2020 and whether there are correlations with other parameters (e.g. COVID-19 diagnosis).
I spoke with Dr. Joseph Zackular (Assistant Professor of Pathology and Laboratory Medicine at CHOP) who is a bacteriologist and an expert on C. difficile, the most common hospital-acquired infection. He suggested I stratify the dataset based on sex and age as demographics that might affect hospital infections. He suggested I look at bacterial pneumonia as a positive control as it is commonly associated with COVID co-infection.
I also spoke with Dr. Kyle Bittinger (Bioinformatics Laboratory Director, CHOP Microbiome Center) who gave suggestions on the code and analysis. For example, he suggested I adjust for other variables (sex, age, etc.) for analyzing the association between COVID-19 and secondary infections. He also suggested I expand on the change over time in age proportions in the bacterial pneumonia as that is the most interesting finding.
Hospital-acquired infections (HAI) are infections acquired after hospital admission. It includes catheter-associated urinary tract infections, central line-associated blood stream infections, surgical site infections, ventilator-associated and hospital-acquired pneumonia, and Clostridioides difficile infections. The risk of HAIs depend on many factors, including the facility’s disinfection and infection prevention practices, the patient’s immune status, length of stay in the hospital, co-morbitities, ventilator support, and use of invasive procedures. Receipt of antibiotics is one of the major risk factors for developing Clostridioides difficile infection and other multi-drug resistant bacterial infections (e.g. Vancomycin resistant Enterococcus or Methicillin resistant Staphylococcus aureus). According to the CDC, about 4% of hospitalized patients experience at least one HAI, with an estimated 648,000 cases in 2011; these are dominated by pneumonia, surgical site infections, gastrointestinal infections, UTIs, and bloodstream infections. Clostidioides difficile infection is the most common cause of hospital acquired infections and can cause severe, potentially life-threatening colitis. Common causes of hospital-associated and ventilator-associated pneumonia are Staphylococcus aureus, Pseudomonas aeruginosa, E. coli, and Klebsiella penumoniae. Common causes of catheter-associated UTIs are Enterococcus, Staphylcoccus aureus, Pseudomonas, Proteus, Klebsiella, and Candida. Common causes of surgical site infections include Staphylococcus aureus, other Staphylococcus, Enterococcus, E. coli, Pseudomonas aeruginosa, Enterobacter, and Klebsiella.
The COVID-19 pandemic led to a surge in hospital visits, overwhelming our healthcare system and led to over a million deaths. Coincidentally, disinfection practices intensified in public facilities including hospitals. Many causes of hospital-associated infections, including C. difficile, MRSA, VRE, norovirus are difficult to disinfect using traditional cleaning methods. Additionally, antibiotics exposure increased during the pandemic. The present study aims to determine how the COVID-19 pandemic affected the incidence of hospital-associated infections in 2020.
This is an interdisciplinary question because it requires an understanding of microbiology and the various factors that affect pathogenesis, epidemiology to appreciate how disease is spread in a hospital setting, and data science for bioinformatics analysis of hospital survey data.
Sources:
Monegro et al. Hospital-Acquired Infections. (2023). In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing. https://www.ncbi.nlm.nih.gov/books/NBK441857/
Dewey et al. Increased Use of Disinfectants During the COVID-19 Pandemic and Its Potential Impacts on Health and Safety. (2021) Acs Chem Health Safety.
Dancer, S.J. Controlling Hospital-Acquired Infection: Focus on the Role of the Environment and New Technologies for Decontamination. (2014) Clin Microbiol Rev.
3 Methods
I used the 2020 data set from the CDC National Hospital Care Survey (NHCS) to answer my question. I first imported the data set and determined the variables available. The useful variables in the data were the age, sex, length of stay, discharge month, and diagnosis.
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("ggplot2")library("cowplot")
Attaching package: 'cowplot'
The following object is masked from 'package:lubridate':
stamp
library("RColorBrewer")library("forcats") # for fct_relevellibrary("tidymodels")
# Read in NHCS 2020 inpatient Public Use File R Dataset#https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHCS/2020/R/nhcs2020ip_r.rdsurl <-"https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHCS/2020/R/nhcs2020ip_r.rds"nhcs2020ip <-read_rds(url)
# inpatient data set#pull out variables listvar_2020 <- nhcs2020ip$variables#select useful columnsvar_2020_select <-select(var_2020, 1:71)#variable names varnames_2020 <-colnames(var_2020_select)#pull out primary diagnosis (DX1)diag1_2020 <- var_2020_select$DX1#determine the number of primary C. diff infections (ICD-10 diagnosis code: A047)primaryCdiff_2020 <-sum(diag1_2020 =="A047", na.rm =TRUE)
I next determined the proportion of patients that stayed only for 1 day.
# count the number of patients that stayed for only 1 dayvar_2020_LOS_1 <- var_2020_select |>filter(LOS ==1) |>summarize(count =n()) |>pull(count)ip_total_count <- var_2020_select |>group_by(LOS) |>summarize(count =n()) |>summarize(sum(count))prop_ip_routine <- var_2020_LOS_1 / ip_total_countprop_ip_routine
I assigned the description of the diagnosis (ICD-10 code description) to the data set to better understand the patients in the data set.
# Add descriptions to diagnosis# Read the lines from the filelines <-readLines("./2021-code-descriptions-tabular-order/icd10cm_codes_2021.txt")# Split each line by 2 or more spacessplit_lines <-strsplit(lines, "\\s{2,}")# Find the maximum number of columnsmax_length <-max(sapply(split_lines, length))# Pad shorter rows with NApadded_lines <-lapply(split_lines, function(x) {length(x) <- max_length # Extend the lengthreturn(x) # Returns the padded line})# Convert the list to a data frameicd_code <-do.call(rbind, lapply(padded_lines, function(x) as.data.frame(t(x), stringsAsFactors =FALSE)))# Optionally, set column names if neededcolnames(icd_code) <-c("DX", "Description")alldiag_desc <-left_join(alldiag_2020, icd_code)
Joining with `by = join_by(DX)`
# diagnosis arranged by number of casesalldiag_desc |>group_by(Description) |>summarize(count =n()) |>arrange(desc(count))
# A tibble: 3,814 × 2
Description count
<chr> <int>
1 <NA> 3217815
2 Essential (primary) hypertension 30922
3 Hyperlipidemia, unspecified 27947
4 Acute kidney failure, unspecified 18190
5 Gastro-esophageal reflux disease without esophagitis 17444
6 Single live birth 12200
7 Encounter for immunization 11723
8 Major depressive disorder, single episode, unspecified 11503
9 Anxiety disorder, unspecified 11406
10 Hypo-osmolality and hyponatremia 10649
# ℹ 3,804 more rows
I next filtered patients who were diagnosed with COVID-19 and hospital-associated infections and subsequently plotted by the variables (sex, age, length of stay, discharge month).
#re-level agecovid_patients <- covid_patients |>mutate(age.simplified =cut(AGE, breaks =c(0, 5, 10, 15, 20, seq(30, 100, by =10)), right =FALSE, labels =c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))#count(covid_patients, age.simplified)#count(covid_patients, AGE)#count(covid_patients, SEX)#count(covid_patients, LOS) #length of stay#count(covid_patients, LOS_30DAYS) #greater than 30 days#count(covid_patients, DISCHARGE_MONTH)#count(covid_patients, DISCHARGE_STATUS)
#visualizationggplot(covid_patients, aes(x = SEX)) +geom_bar() +labs(x ="Sex", y ="Count") +ggtitle("COVID-19 cases by sex") +theme_bw()
ggplot(covid_patients, aes(x = age.simplified)) +geom_bar() +labs(x ="Age", y ="Count") +ggtitle("COVID-19 cases by age") +theme_bw()
ggplot(covid_patients, aes(x = LOS)) +geom_bar() +labs(x ="Length of stay (up to 14 days)", y ="Count") +ggtitle("COVID-19 cases by length of stay") +theme_bw()
#greater than 15 binned in dataggplot(covid_patients, aes(x = LOS_30DAYS)) +geom_bar() +labs(x ="Length of stay greater than 30 days", y ="Count") +ggtitle("COVID-19 cases by length of stay greater than 30 days") +theme_bw()
ggplot(covid_patients, aes(x = DISCHARGE_MONTH)) +geom_bar() +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("COVID-19 cases by discharge month") +theme_bw()
#subset bacterial pneumonia patientsbactpneumo_patients <-filter(alldiag_2020, DX %in%c("J13", "J14", "J150", "J151", "J1520", "J15211", "J1529", "J153", "J154", "J155", "J1561", "J1569", "J157", "J158", "J159", "J160"))#J13 Pneumonia due to Streptococcus pneumoniae#J14 Pneumonia due to Hemophilus influenzae#J150 Pneumonia due to Klebsiella pneumoniae#J151 Pneumonia due to Pseudomonas#J1520 Pneumonia due to staphylococcus, unspecified#J15211 Pneumonia due to Methicillin susceptible Staphylococcus aureus#J15212 Pneumonia due to Methicillin resistant Staphylococcus aureus#J1529 Pneumonia due to other staphylococcus#J153 Pneumonia due to streptococcus, group B#J154 Pneumonia due to other streptococci#J155 Pneumonia due to Escherichia coli#J1561 Pneumonia due to Acinetobacter baumannii#J1569 Pneumonia due to other Gram-negative bacteria#J157 Pneumonia due to Mycoplasma pneumoniae#J158 Pneumonia due to other specified bacteria#J159 Unspecified bacterial pneumonia#J160 Chlamydial pneumonia
#re-level agebactpneumo_patients <- bactpneumo_patients |>mutate(age.simplified =cut(AGE, breaks =c(0, 5, 10, 15, 20, seq(30, 100, by =10)), right =FALSE, labels =c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))#count(bactpneumo_patients, age.simplified)#count(bactpneumo_patients, SEX)#count(bactpneumo_patients, LOS) #length of stay#count(bactpneumo_patients, LOS_30DAYS) #greater than 30 days#count(bactpneumo_patients, DISCHARGE_MONTH)#count(bactpneumo_patients, DISCHARGE_STATUS)
#visualizationggplot(bactpneumo_patients, aes(x = SEX)) +geom_bar() +labs(x ="Sex", y ="Count") +ggtitle("Bacterial penumonia cases by sex") +theme_bw()
ggplot(bactpneumo_patients, aes(x = age.simplified)) +geom_bar() +labs(x ="Age", y ="Count") +ggtitle("Bacterial penumonia cases by age") +theme_bw()
ggplot(bactpneumo_patients, aes(x = LOS)) +geom_bar() +labs(x ="Length of stay (up to 14 days)", y ="Count") +ggtitle("Bacterial penumonia cases by length of stay") +theme_bw()
#greater than 15 binned in dataggplot(bactpneumo_patients, aes(x = LOS_30DAYS)) +geom_bar() +labs(x ="Length of stay greater than 30 days", y ="Count") +ggtitle("Bacterial penumonia cases by length of stay greater than 30 days") +theme_bw()
ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH)) +geom_bar() +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("Bacterial penumonia cases by discharge month") +theme_bw()
#plot number of cases by demographics#re-level ageCdiff_patients <- Cdiff_patients |>mutate(age.simplified =cut(AGE, breaks =c(0, 5, 10, 15, 20, seq(30, 100, by =10)), right =FALSE, labels =c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))#count(Cdiff_patients, age.simplified)#count(Cdiff_patients, SEX)#count(Cdiff_patients, LOS) #length of stay#count(Cdiff_patients, LOS_30DAYS) #greater than 30 days#count(Cdiff_patients, DISCHARGE_MONTH)#count(Cdiff_patients, DISCHARGE_STATUS)
#visualizationggplot(Cdiff_patients, aes(x = SEX)) +geom_bar() +labs(x ="Sex", y ="Count") +ggtitle("C. difficile infection by sex") +theme_bw()
ggplot(Cdiff_patients, aes(x = age.simplified)) +geom_bar() +labs(x ="Age", y ="Count") +ggtitle("C. difficile infection by age") +theme_bw()
ggplot(Cdiff_patients, aes(x = LOS)) +geom_bar() +labs(x ="Length of stay (up to 14 days)", y ="Count") +ggtitle("C. difficile infection by length of stay") +theme_bw()
#greater than 15 binned in dataggplot(Cdiff_patients, aes(x = LOS_30DAYS)) +geom_bar() +labs(x ="Length of stay greater than 30 days", y ="Count") +ggtitle("C. difficile infection by length of stay greater than 30 days") +theme_bw()
ggplot(Cdiff_patients, aes(x = DISCHARGE_MONTH)) +geom_bar() +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("C. difficile infection by discharge month") +theme_bw()
#MRSAmrsa_patients <-filter(alldiag_2020, DX %in%c("A4102", "A4902", "B9562", "J15212"))#A4102 Sepsis due to Methicillin resistant Staphylococcus aureus#A4902 Methicillin resistant Staphylococcus aureus infection, unspecified site#B9562 Methicillin resistant Staphylococcus aureus infection as the cause of diseases classified elsewhere#J15212 Pneumonia due to Methicillin resistant Staphylococcus aureus# no MRSA cases in dataset
#enterococcusentero_patients <-filter(alldiag_2020, DX %in%c("A4181", "B952"))#A4181 Sepsis due to Enterococcus#B952 Enterococcus as the cause of diseases classified elsewhere
#plot number of cases by demographics#re-level ageentero_patients <- entero_patients |>mutate(age.simplified =cut(AGE, breaks =c(0, 5, 10, 15, 20, seq(30, 100, by =10)), right =FALSE, labels =c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))#count(entero_patients, age.simplified)#count(entero_patients, SEX)#count(entero_patients, LOS) #length of stay#count(entero_patients, LOS_30DAYS) #greater than 30 days#count(entero_patients, DISCHARGE_MONTH)#count(entero_patients, DISCHARGE_STATUS)
#visualizationggplot(entero_patients, aes(x = SEX)) +geom_bar() +labs(x ="Sex", y ="Count") +ggtitle("Enterococcus infections by sex") +theme_bw()
ggplot(entero_patients, aes(x = age.simplified)) +geom_bar() +labs(x ="Age", y ="Count") +ggtitle("Enterococcus infections by age") +theme_bw()
ggplot(entero_patients, aes(x = LOS)) +geom_bar() +labs(x ="Length of stay (up to 14 days)", y ="Count") +ggtitle("Enterococcus infections by length of stay") +theme_bw()
#greater than 15 binned in dataggplot(entero_patients, aes(x = LOS_30DAYS)) +geom_bar() +labs(x ="Length of stay greater than 30 days", y ="Count") +ggtitle("Enterococcus infections by length of stay greater than 30 days") +theme_bw()
ggplot(entero_patients, aes(x = DISCHARGE_MONTH)) +geom_bar() +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("Enterococcus infections by discharge month") +theme_bw()
#infections from catheterinfcatheter_patients <-filter(alldiag_2020, DX %in%c("T80211A", "T80211D ", "T80211S", "T80212A", "T80212D", "T80212S", "T80218A", "T80218D", "T80218S", "T80219A", "T80219D", "T80219S"))#T80211A Bloodstream infection due to central venous catheter, initial encounter#T80211D Bloodstream infection due to central venous catheter, subsequent encounter#T80211S Bloodstream infection due to central venous catheter, sequela#T80212A Local infection due to central venous catheter, initial encounter#T80212D Local infection due to central venous catheter, subsequent encounter#T80212S Local infection due to central venous catheter, sequela#T80218A Other infection due to central venous catheter, initial encounter#T80218D Other infection due to central venous catheter, subsequent encounter#T80218S Other infection due to central venous catheter, sequela#T80219A Unspecified infection due to central venous catheter, initial encounter#T80219D Unspecified infection due to central venous catheter, subsequent encounter#T80219S Unspecified infection due to central venous catheter, sequela# no catheter-associated infections in dataset
# ventilator-associated pneumoniaventpenumo_patients <-filter(alldiag_2020, DX =="J95851")#J95851 Ventilator associated pneumonia# no ventilator-associated pneumonia cases in dataset
# surgical site infectionssurginf_patients <-filter(alldiag_2020, DX %in%c("T8141XA", "T8141XD", "T8141XS", "T8142XA", "T8142XD", "T8142XS", "T8143XA", "T8143XD", "T8143XS", "O8600", "O8601", "O8602", "O8603", "O8604", "08609"))#T8141XA Infection following a procedure, superficial incisional surgical site, initial encounter#T8141XD Infection following a procedure, superficial incisional surgical site, subsequent encounter#T8141XS Infection following a procedure, superficial incisional surgical site, sequela#T8142XA Infection following a procedure, deep incisional surgical site, initial encounter#T8142XD Infection following a procedure, deep incisional surgical site, subsequent encounter#T8142XS Infection following a procedure, deep incisional surgical site, sequela#T8143XA Infection following a procedure, organ and space surgical site, initial encounter#T8143XD Infection following a procedure, organ and space surgical site, subsequent encounter#T8143XS Infection following a procedure, organ and space surgical site, sequela#O8600 Infection of obstetric surgical wound, unspecified#O8601 Infection of obstetric surgical wound, superficial incisional site#O8602 Infection of obstetric surgical wound, deep incisional site#O8603 Infection of obstetric surgical wound, organ and space site#O8604 Sepsis following an obstetrical procedure#O8609 Infection of obstetric surgical wound, other surgical site# no surgical site infection cases in dataset
There were no cases of infections associated with surgeries, infections from catheter use, methicillin-resistant Staphylococcus aureus infections, or ventilator-associated pneumonia in the dataset.
#plot month data together #drop a color for COVID to make uniformcolors_set3 <-brewer.pal(n =12, name ="Set3")[-1]covid_month <-ggplot(covid_patients, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_manual(values = colors_set3) +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +theme_bw() +theme(legend.position ="none")#adjust margins for cowplotcovid_month <- covid_month +theme(plot.margin =margin(t =20, r =10, b =10, l =10))bactpneumo_month <-ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +theme_bw() +theme(legend.position ="none")#adjust margins for cowplotbactpneumo_month <- bactpneumo_month +theme(plot.margin =margin(t =20, r =10, b =10, l =10))cdiff_month <-ggplot(Cdiff_patients, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +theme_bw() +theme(legend.position ="none")#adjust margins for cowplotcdiff_month <- cdiff_month +theme(plot.margin =margin(t =20, r =10, b =10, l =10))entero_month <-ggplot(entero_patients, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +theme_bw() +theme(legend.position ="none")entero_month <- entero_month +theme(plot.margin =margin(t =20, r =10, b =10, l =10))plot_grid(covid_month, bactpneumo_month, cdiff_month, entero_month, labels =c('COVID-19', 'Bacterial pneumonia', 'C. difficile', 'Enterococcus'), label_size =12, label_x =-0.05, label_y =1)
Next, I analyzed for association between discharge month and infection.
# determine whether discharge month is associated with COVID # filter DISCHARGE_MONTH and DX columnsdx_month <- alldiag_2020 |>select(c(DISCHARGE_MONTH, DX))# Remove NA in DXdx_month_narm <- dx_month[!is.na(dx_month$DX), ]# Remove NA in Discharge Monthdx_month_narm <- dx_month_narm[!is.na(dx_month_narm$DISCHARGE_MONTH), ]# create new column with COVID infection as factorcovid_binary_month <- dx_month_narm |>mutate(COVID =ifelse(DX =="U071", 1, 0))# month as factorcovid_binary_month$DISCHARGE_MONTH <-factor( covid_binary_month$DISCHARGE_MONTH, levels =1:12, labels = month.name)# relevel to set reference month to Aprilcovid_binary_month$DISCHARGE_MONTH <-fct_relevel(covid_binary_month$DISCHARGE_MONTH, "April")# Fit the logistic modelcovid.month.fit <-glm(COVID ~ DISCHARGE_MONTH, data = covid_binary_month, family = binomial)summary(covid.month.fit)
Call:
glm(formula = COVID ~ DISCHARGE_MONTH, family = binomial, data = covid_binary_month)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.33340 0.02855 -151.786 < 2e-16 ***
DISCHARGE_MONTHJanuary -7.50209 1.00041 -7.499 6.43e-14 ***
DISCHARGE_MONTHFebruary -16.23267 49.74714 -0.326 0.7442
DISCHARGE_MONTHMarch -5.76748 0.44813 -12.870 < 2e-16 ***
DISCHARGE_MONTHMay -0.60721 0.04544 -13.363 < 2e-16 ***
DISCHARGE_MONTHJune -1.31565 0.05666 -23.221 < 2e-16 ***
DISCHARGE_MONTHJuly -1.35050 0.05550 -24.334 < 2e-16 ***
DISCHARGE_MONTHAugust -1.54094 0.05988 -25.735 < 2e-16 ***
DISCHARGE_MONTHSeptember -1.90079 0.06933 -27.416 < 2e-16 ***
DISCHARGE_MONTHOctober -1.36564 0.05500 -24.829 < 2e-16 ***
DISCHARGE_MONTHNovember -0.58515 0.04379 -13.364 < 2e-16 ***
DISCHARGE_MONTHDecember -0.09776 0.03829 -2.553 0.0107 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83171 on 1498051 degrees of freedom
Residual deviance: 76909 on 1498040 degrees of freedom
AIC: 76933
Number of Fisher Scoring iterations: 19
# evaluate logistic regression # set enginelr_cls_spec <-logistic_reg() |>set_engine("glm")covid_binary_month_factor <- covid_binary_month |>mutate(DISCHARGE_MONTH =as.factor(DISCHARGE_MONTH))covid_binary_month_factor <- covid_binary_month_factor |>mutate(COVID =as.factor(COVID))# fit model lr_cls_fit_covid_month <- lr_cls_spec |>fit(COVID ~ DISCHARGE_MONTH, data = covid_binary_month_factor)lr_cls_fit_covid_month
Bacterial pneumonia was negatively associated with discharge months Jan, Feb, and Jun~Oct and positively associated with April.
The model performed poorly at predicting bacterial pneumonia by month, only slightly better than random chance (0.5).
# determine whether discharge month is associated with C. difficile infection # create new column with C. diff infection as factorcdiff_binary_month <- alldiag_2020 |>mutate(cdiff =ifelse(DX =="A047", 1, 0))# Fit the linear modelcdiff.month.fit <-glm(cdiff ~factor(DISCHARGE_MONTH), data = cdiff_binary_month, family = binomial)summary(cdiff.month.fit)
Call:
glm(formula = cdiff ~ factor(DISCHARGE_MONTH), family = binomial,
data = cdiff_binary_month)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.161890 0.096711 -74.054 <2e-16 ***
factor(DISCHARGE_MONTH)2 -0.182908 0.146820 -1.246 0.2128
factor(DISCHARGE_MONTH)3 -0.015104 0.141824 -0.106 0.9152
factor(DISCHARGE_MONTH)4 -0.150278 0.158077 -0.951 0.3418
factor(DISCHARGE_MONTH)5 -0.228811 0.153778 -1.488 0.1368
factor(DISCHARGE_MONTH)6 -0.097025 0.145829 -0.665 0.5058
factor(DISCHARGE_MONTH)7 -0.287661 0.150062 -1.917 0.0552 .
factor(DISCHARGE_MONTH)8 -0.315877 0.151853 -2.080 0.0375 *
factor(DISCHARGE_MONTH)9 -0.003822 0.139506 -0.027 0.9781
factor(DISCHARGE_MONTH)10 -0.263541 0.147330 -1.789 0.0736 .
factor(DISCHARGE_MONTH)11 -0.150290 0.145828 -1.031 0.3027
factor(DISCHARGE_MONTH)12 -0.259636 0.148387 -1.750 0.0802 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16506 on 1498051 degrees of freedom
Residual deviance: 16494 on 1498040 degrees of freedom
(2482768 observations deleted due to missingness)
AIC: 16518
Number of Fisher Scoring iterations: 10
# evaluate the logistic regression modelcdiff_binary_month_factor <- cdiff_binary_month |>mutate(DISCHARGE_MONTH =as.factor(DISCHARGE_MONTH))cdiff_binary_month_factor <- cdiff_binary_month_factor |>mutate(cdiff =as.factor(cdiff))# fit model lr_cls_fit_cdiff_month <- lr_cls_spec |>fit(cdiff ~ DISCHARGE_MONTH, data = cdiff_binary_month_factor)lr_cls_fit_cdiff_month
parsnip model object
Call: stats::glm(formula = cdiff ~ DISCHARGE_MONTH, family = stats::binomial,
data = data)
Coefficients:
(Intercept) DISCHARGE_MONTH2 DISCHARGE_MONTH3 DISCHARGE_MONTH4
-7.161890 -0.182908 -0.015104 -0.150278
DISCHARGE_MONTH5 DISCHARGE_MONTH6 DISCHARGE_MONTH7 DISCHARGE_MONTH8
-0.228811 -0.097025 -0.287661 -0.315877
DISCHARGE_MONTH9 DISCHARGE_MONTH10 DISCHARGE_MONTH11 DISCHARGE_MONTH12
-0.003822 -0.263541 -0.150290 -0.259636
Degrees of Freedom: 1498051 Total (i.e. Null); 1498040 Residual
(2482768 observations deleted due to missingness)
Null Deviance: 16510
Residual Deviance: 16490 AIC: 16520
# model predictive abilitylr_training_pred_cdiff_month <-bind_cols(truth = cdiff_binary_month_factor$cdiff,predict(lr_cls_fit_cdiff_month, cdiff_binary_month_factor),predict(lr_cls_fit_cdiff_month, cdiff_binary_month_factor, type ="prob"))lr_training_pred_cdiff_month
Only January and August was negatively associated with C. difficile infection indicating that there is no seasonality in cases.
Model performed poorly in predicting C. difficile infection by discharge month.
# determine whether discharge month is associated with Enterococcus infection entero_code <-c("A4181", "B952")# create new column with Enterococcus infection as factorentero_binary_month <- alldiag_2020 |>mutate(entero =ifelse(DX %in% entero_code, 1, 0))# Fit the linear modelentero.month.fit <-glm(entero ~factor(DISCHARGE_MONTH), data = entero_binary_month, family = binomial)summary(entero.month.fit)
Call:
glm(formula = entero ~ factor(DISCHARGE_MONTH), family = binomial,
data = entero_binary_month)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.88586 0.13868 -64.072 <2e-16 ***
factor(DISCHARGE_MONTH)2 -0.16754 0.20887 -0.802 0.4225
factor(DISCHARGE_MONTH)3 -0.04716 0.20485 -0.230 0.8179
factor(DISCHARGE_MONTH)4 -0.07022 0.22057 -0.318 0.7502
factor(DISCHARGE_MONTH)5 0.19973 0.19709 1.013 0.3109
factor(DISCHARGE_MONTH)6 -0.11961 0.21184 -0.565 0.5723
factor(DISCHARGE_MONTH)7 0.15738 0.19260 0.817 0.4139
factor(DISCHARGE_MONTH)8 -0.41638 0.22692 -1.835 0.0665 .
factor(DISCHARGE_MONTH)9 0.01575 0.20128 0.078 0.9376
factor(DISCHARGE_MONTH)10 -0.02964 0.20017 -0.148 0.8823
factor(DISCHARGE_MONTH)11 0.02457 0.20242 0.121 0.9034
factor(DISCHARGE_MONTH)12 0.13403 0.19520 0.687 0.4923
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10733 on 3980819 degrees of freedom
Residual deviance: 10721 on 3980808 degrees of freedom
AIC: 10745
Number of Fisher Scoring iterations: 12
# evaluate the logistic regression modelentero_binary_month_factor <- entero_binary_month |>mutate(DISCHARGE_MONTH =as.factor(DISCHARGE_MONTH))entero_binary_month_factor <- entero_binary_month_factor |>mutate(entero =as.factor(entero))# fit model lr_cls_fit_entero_month <- lr_cls_spec |>fit(entero ~ DISCHARGE_MONTH, data = entero_binary_month_factor)lr_cls_fit_entero_month
Similar to C. difficle, Enterococcus infection was only negatively associated with January and August indicating no seasonality.
The logistic regression model performed poorly in predicting Enterococcus infection by discharge month.
Bacterial pneumonia follows a similar seasonality as COVID-19 as expected, but C. difficile and Enterococcus infections appears to be pretty consistent throughout the year.
Next, I looked at the age distribution of each infection.
#plot age data together covid_age <-ggplot(covid_patients, aes(x = age.simplified, fill = age.simplified)) +geom_bar() +scale_fill_brewer(palette ="Set3") +labs(x ="Age", y ="Count") +theme_bw() +theme(axis.text.x =element_text(angle =45, hjust =1),legend.position ="none")#adjust margins for cowplotcovid_age <- covid_age +theme(plot.margin =margin(t =20, r =10, b =10, l =10))bactpneumo_age <-ggplot(bactpneumo_patients, aes(x = age.simplified, fill = age.simplified)) +geom_bar() +scale_fill_brewer(palette ="Set3") +labs(x ="Age", y ="Count") +theme_bw() +theme(axis.text.x =element_text(angle =45, hjust =1),legend.position ="none")#adjust margins for cowplotbactpneumo_age <- bactpneumo_age +theme(plot.margin =margin(t =20, r =10, b =10, l =10))cdiff_age <-ggplot(Cdiff_patients, aes(x = age.simplified, fill = age.simplified)) +geom_bar() +scale_fill_brewer(palette ="Set3") +labs(x ="Age", y ="Count") +theme_bw() +theme(axis.text.x =element_text(angle =45, hjust =1),legend.position ="none")#adjust margins for cowplotcdiff_age <- cdiff_age +theme(plot.margin =margin(t =20, r =10, b =10, l =10))entero_age <-ggplot(entero_patients, aes(x = age.simplified, fill = age.simplified)) +geom_bar() +scale_fill_brewer(palette ="Set3") +labs(x ="Age", y ="Count") +theme_bw() +theme(axis.text.x =element_text(angle =45, hjust =1),legend.position ="none")entero_age <- entero_age +theme(plot.margin =margin(t =20, r =10, b =10, l =10))plot_grid(covid_age, bactpneumo_age, cdiff_age, entero_age, labels =c('COVID-19', 'Bacterial pneumonia', 'C. difficile', 'Enterococcus'), label_size =12, label_x =-0.05, label_y =1)
# Determine whether Age is associated with COVID-19# filter DISCHARGE_MONTH and DX columnsdx_age <- alldiag_desc |>select(c(AGE, DX))# Remove NA in DXdx_age_narm <- dx_age[!is.na(dx_age$DX), ]# Remove NA in Discharge Monthdx_age_narm <- dx_age_narm[!is.na(dx_age_narm$AGE), ]# create new column with COVID infection as factorcovid_binary_age <- dx_age_narm |>mutate(COVID =ifelse(DX =="U071", 1, 0))# Fit the logistic modelcovid.age.fit <-glm(COVID ~ AGE, data = covid_binary_age, family = binomial)summary(covid.age.fit)
Call:
glm(formula = COVID ~ AGE, family = binomial, data = covid_binary_age)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.9510096 0.0380230 -156.51 <2e-16 ***
AGE 0.0086774 0.0005929 14.64 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83128 on 1496830 degrees of freedom
Residual deviance: 82901 on 1496829 degrees of freedom
AIC: 82905
Number of Fisher Scoring iterations: 8
# Bacterial pneumonia # create new column with COVID infection as factorbactpneumo_binary_age <- dx_age_narm |>mutate(bactpneumo =ifelse(DX %in% bactpneumo_code, 1, 0))# Fit the logistic modelbactpneumo.age.fit <-glm(bactpneumo ~ AGE, data = bactpneumo_binary_age, family = binomial)summary(bactpneumo.age.fit)
Call:
glm(formula = bactpneumo ~ AGE, family = binomial, data = bactpneumo_binary_age)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.241325 0.073380 -98.682 < 2e-16 ***
AGE 0.007636 0.001150 6.637 3.19e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 26072 on 1496830 degrees of freedom
Residual deviance: 26025 on 1496829 degrees of freedom
AIC: 26029
Number of Fisher Scoring iterations: 9
# C. difficile# create new column with COVID infection as factorcdiff_binary_age <- dx_age_narm |>mutate(cdiff =ifelse(DX =="A047", 1, 0))# Fit the logistic modelcdiff.age.fit <-glm(cdiff ~ AGE, data = cdiff_binary_age, family = binomial)summary(cdiff.age.fit)
Call:
glm(formula = cdiff ~ AGE, family = binomial, data = cdiff_binary_age)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.982675 0.100811 -79.185 < 2e-16 ***
AGE 0.011227 0.001551 7.241 4.46e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16505 on 1496830 degrees of freedom
Residual deviance: 16448 on 1496829 degrees of freedom
AIC: 16452
Number of Fisher Scoring iterations: 10
# Enterococcus# create new column with COVID infection as factorentero_binary_age <- dx_age_narm |>mutate(entero =ifelse(DX %in% entero_code, 1, 0))# Fit the logistic modelentero.age.fit <-glm(entero ~ AGE, data = entero_binary_age, family = binomial)summary(entero.age.fit)
Call:
glm(formula = entero ~ AGE, family = binomial, data = entero_binary_age)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.563859 0.135560 -63.174 < 2e-16 ***
AGE 0.010844 0.002089 5.191 2.09e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9673.0 on 1496830 degrees of freedom
Residual deviance: 9643.9 on 1496829 degrees of freedom
AIC: 9647.9
Number of Fisher Scoring iterations: 11
# evaluate predictive performance of logistic regression for covidcovid_binary_age_factor <- covid_binary_age |>mutate(AGE =as.factor(AGE))covid_binary_age_factor <- covid_binary_age_factor |>mutate(COVID =as.factor(COVID))# fit model lr_cls_fit_covid_age <- lr_cls_spec |>fit(COVID ~ AGE, data = covid_binary_age_factor)lr_cls_fit_covid_age
# evaluate predictive performance of logistic regression for C. difficilecdiff_binary_age_factor <- cdiff_binary_age |>mutate(AGE =as.factor(AGE))cdiff_binary_age_factor <- cdiff_binary_age_factor |>mutate(cdiff =as.factor(cdiff))# fit model lr_cls_fit_cdiff_age <- lr_cls_spec |>fit(cdiff ~ AGE, data = cdiff_binary_age_factor)lr_cls_fit_cdiff_age
All 4 infections predominantly affects older individuals (highest 60-69).
Despite this, the predictive performance of the logistic model was poor and performed only slightly better than random chance.
# proportions by discharge month and ageggplot(covid_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +geom_bar(position ="fill") +scale_fill_brewer(palette ="Spectral") +scale_x_continuous(breaks =1:12) +labs(x ="Discharge Month", y ="Proportion", fill ="Age") +ggtitle("COVID-19 cases") +theme_bw()
ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +geom_bar(position ="fill") +scale_fill_brewer(palette ="Spectral") +scale_x_continuous(breaks =1:12) +labs(x ="Discharge Month", y ="Proportion", fill ="Age") +ggtitle("Bacterial pneumonia cases") +theme_bw()
ggplot(Cdiff_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +geom_bar(position ="fill") +scale_fill_brewer(palette ="Spectral") +scale_x_continuous(breaks =1:12) +labs(x ="Discharge Month", y ="Proportion", fill ="Age") +ggtitle("C. difficile infections") +theme_bw()
ggplot(entero_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +geom_bar(position ="fill") +scale_fill_brewer(palette ="Spectral") +scale_x_continuous(breaks =1:12) +labs(x ="Discharge Month", y ="Proportion", fill ="Age") +ggtitle("Enterococcus infections") +theme_bw()
# expand on bacterial pneumonia for kids under 10bactpneumo_under10 <- bactpneumo_patients |>filter(AGE <10)bactpneumo_under10_plot <-ggplot(bactpneumo_under10, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +xlab("Discharge Month") +ggtitle("Bacterial pneumonia cases under 10 years old") +theme_bw() +theme(legend.position ="none")bactpneumo_under10_plot
# determine whether discharge month is associated with bacterial pneumonia under 10alldiag_2020_under10 <- alldiag_2020 |>filter(AGE <10)bactpneumo_under10_binary_month <- alldiag_2020_under10 |>mutate(bactpneumo =ifelse(DX %in% bactpneumo_code, 1, 0))# Fit the linear modelbactpneumo.under10.month.fit <-glm(bactpneumo ~factor(DISCHARGE_MONTH), data = bactpneumo_under10_binary_month, family = binomial)summary(bactpneumo.under10.month.fit)
Call:
glm(formula = bactpneumo ~ factor(DISCHARGE_MONTH), family = binomial,
data = bactpneumo_under10_binary_month)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.3202 0.1716 -42.669 < 2e-16 ***
factor(DISCHARGE_MONTH)2 -0.6339 0.2971 -2.134 0.032873 *
factor(DISCHARGE_MONTH)3 -0.6580 0.3032 -2.170 0.030022 *
factor(DISCHARGE_MONTH)4 -1.4468 0.4429 -3.267 0.001087 **
factor(DISCHARGE_MONTH)5 -17.2459 642.2556 -0.027 0.978578
factor(DISCHARGE_MONTH)6 -3.3316 1.0146 -3.284 0.001025 **
factor(DISCHARGE_MONTH)7 -1.7686 0.4790 -3.692 0.000222 ***
factor(DISCHARGE_MONTH)8 -3.4035 1.0146 -3.354 0.000795 ***
factor(DISCHARGE_MONTH)9 -2.2344 0.6023 -3.710 0.000208 ***
factor(DISCHARGE_MONTH)10 -17.2459 629.0164 -0.027 0.978127
factor(DISCHARGE_MONTH)11 -1.9128 0.5286 -3.618 0.000296 ***
factor(DISCHARGE_MONTH)12 -2.1338 0.6023 -3.543 0.000396 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1740.3 on 523499 degrees of freedom
Residual deviance: 1630.3 on 523488 degrees of freedom
AIC: 1654.3
Number of Fisher Scoring iterations: 23
The proportion of kids under 10 who were diagnosed for bacterial pneumonia decreased after the first few months.
I next looked at the sex distribution of infection.
# plot infection by sex togethercovid_sex <-ggplot(covid_patients, aes(x = SEX, fill = SEX)) +geom_bar() +scale_fill_manual(values =c("Male"="blue", "Female"="red")) +labs(x ="Sex", y ="Count") +ggtitle("COVID-19") +theme_bw() +theme(legend.position ="none")bactpneumo_sex <-ggplot(bactpneumo_patients, aes(x = SEX, fill = SEX)) +geom_bar() +scale_fill_manual(values =c("Male"="blue", "Female"="red")) +labs(x ="Sex", y ="Count") +ggtitle("Bacterial penumonia") +theme_bw() +theme(legend.position ="none")cdiff_sex <-ggplot(Cdiff_patients, aes(x = SEX, fill = SEX)) +geom_bar() +scale_fill_manual(values =c("Male"="blue", "Female"="red")) +labs(x ="Sex", y ="Count") +ggtitle("C. difficile") +theme_bw() +theme(legend.position ="none")entero_sex <-ggplot(entero_patients, aes(x = SEX, fill = SEX)) +geom_bar() +scale_fill_manual(values =c("Male"="blue", "Female"="red")) +labs(x ="Sex", y ="Count") +ggtitle("Enterococcus") +theme_bw() +theme(legend.position ="none")plot_grid(covid_sex, bactpneumo_sex, cdiff_sex, entero_sex, label_size =12, label_x =-0.05, label_y =1)
# determine whether sex is assocated with COVID-19 diagnosis# filter DISCHARGE_MONTH and DX columnsdx_sex <- alldiag_2020 |>select(c(SEX, DX))# Remove NA in DXdx_sex_narm <- dx_sex[!is.na(dx_sex$DX), ]# Remove NA in Sexdx_sex_narm <- dx_sex_narm[!is.na(dx_sex_narm$SEX), ]# create new column with COVID infection as factorcovid_binary_sex <- dx_sex_narm |>mutate(COVID =ifelse(DX =="U071", 1, 0))# Fit the logistic modelcovid.sex.fit <-glm(COVID ~factor(SEX), data = covid_binary_sex, family = binomial)summary(covid.sex.fit)
Call:
glm(formula = COVID ~ factor(SEX), family = binomial, data = covid_binary_sex)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.35494 0.01727 -310.132 < 2e-16 ***
factor(SEX)Female -0.17561 0.02498 -7.029 2.07e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83138 on 1496739 degrees of freedom
Residual deviance: 83089 on 1496738 degrees of freedom
AIC: 83093
Number of Fisher Scoring iterations: 8
Female sex is significantly negatively associated (0.839 fold) with COVID-19 compared to males.
# determine whether sex is assocated with bacterial pneumonia diagnosis# create new column with COVID infection as factorbactpneumo_binary_sex <- dx_sex_narm |>mutate(bactpneumo =ifelse(DX %in% bactpneumo_code, 1, 0))# Fit the logistic modelbactpneumo.sex.fit <-glm(bactpneumo ~factor(SEX), data = bactpneumo_binary_sex, family = binomial)summary(bactpneumo.sex.fit)
Call:
glm(formula = bactpneumo ~ factor(SEX), family = binomial, data = bactpneumo_binary_sex)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.60264 0.03211 -205.607 <2e-16 ***
factor(SEX)Female -0.41114 0.04958 -8.292 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 26071 on 1496739 degrees of freedom
Residual deviance: 26002 on 1496738 degrees of freedom
AIC: 26006
Number of Fisher Scoring iterations: 9
Female sex is also significantly negatively associated (0.662 fold) with bacterial pneumonia diagnosis compared to males.
# determine whether sex is assocated with C. difficile infection # create new column with COVID infection as factorcdiff_binary_sex <- dx_sex_narm |>mutate(cdiff =ifelse(DX =="A047", 1, 0))# Fit the logistic modelcdiff.sex.fit <-glm(cdiff ~factor(SEX), data = cdiff_binary_sex, family = binomial)summary(cdiff.sex.fit)
Call:
glm(formula = cdiff ~ factor(SEX), family = binomial, data = cdiff_binary_sex)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.33963 0.04639 -158.220 <2e-16 ***
factor(SEX)Female 0.04032 0.06365 0.634 0.526
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16504 on 1496739 degrees of freedom
Residual deviance: 16504 on 1496738 degrees of freedom
AIC: 16508
Number of Fisher Scoring iterations: 10
C. difficile infection is not associated with a particular sex.
# determine whether sex is assocated with Enterococcus infection # create new column with COVID infection as factorentero_binary_sex <- dx_sex_narm |>mutate(entero =ifelse(DX %in% entero_code, 1, 0))# Fit the logistic modelentero.sex.fit <-glm(entero ~factor(SEX), data = entero_binary_sex, family = binomial)summary(entero.sex.fit)
Call:
glm(formula = entero ~ factor(SEX), family = binomial, data = entero_binary_sex)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.79833 0.05833 -133.692 < 2e-16 ***
factor(SEX)Female -0.25511 0.08622 -2.959 0.00309 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9672.9 on 1496739 degrees of freedom
Residual deviance: 9664.1 on 1496738 degrees of freedom
AIC: 9668.1
Number of Fisher Scoring iterations: 10
# Fit the linear modelcovid.bactpneumo.fit <-glm(bactpneumo ~ COVID, data = alldiag_bactpneumo_binary, family = binomial)summary(covid.bactpneumo.fit)
Call:
glm(formula = bactpneumo ~ COVID, family = binomial, data = alldiag_bactpneumo_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.2344 0.1049 -68.99 < 2e-16 ***
COVID 1.6850 0.2262 7.45 9.33e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1865.7 on 132693 degrees of freedom
Residual deviance: 1826.3 on 132692 degrees of freedom
AIC: 1830.3
Number of Fisher Scoring iterations: 10
# adjust for sexcovid.bactpneumo.sex.fit <-glm(bactpneumo ~ SEX + COVID, data = alldiag_bactpneumo_binary, family = binomial)summary(covid.bactpneumo.sex.fit)
Call:
glm(formula = bactpneumo ~ SEX + COVID, family = binomial, data = alldiag_bactpneumo_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.9436 0.1291 -53.777 < 2e-16 ***
SEXFemale -0.6187 0.1917 -3.227 0.00125 **
COVID 1.6437 0.2265 7.257 3.97e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1865.5 on 132588 degrees of freedom
Residual deviance: 1815.3 on 132586 degrees of freedom
(105 observations deleted due to missingness)
AIC: 1821.3
Number of Fisher Scoring iterations: 10
# adjust for agecovid.bactpneumo.age.fit <-glm(bactpneumo ~ AGE + COVID, data = alldiag_bactpneumo_binary, family = binomial)summary(covid.bactpneumo.age.fit)
Call:
glm(formula = bactpneumo ~ AGE + COVID, family = binomial, data = alldiag_bactpneumo_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.052699 0.262426 -30.686 < 2e-16 ***
AGE 0.015408 0.004154 3.709 0.000208 ***
COVID 1.531408 0.228352 6.706 2e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1865.6 on 132613 degrees of freedom
Residual deviance: 1811.0 on 132611 degrees of freedom
(80 observations deleted due to missingness)
AIC: 1817
Number of Fisher Scoring iterations: 10
# evaluate the predictive ability# set enginelr_cls_spec <-logistic_reg() |>set_engine("glm", family = binomial)# fit model lr_cls_fit_bactpneumo_covid <- lr_cls_spec |>fit(bactpneumo ~ COVID, data = alldiag_bactpneumo_factor)lr_cls_fit_bactpneumo_covid
parsnip model object
Call: stats::glm(formula = bactpneumo ~ COVID, family = ~binomial,
data = data)
Coefficients:
(Intercept) COVIDPositive
-7.234 1.685
Degrees of Freedom: 132693 Total (i.e. Null); 132692 Residual
Null Deviance: 1866
Residual Deviance: 1826 AIC: 1830
# model predictive abilitylr_training_pred_bactpneumo_covid <-bind_cols(truth = alldiag_bactpneumo_factor$bactpneumo,predict(lr_cls_fit_bactpneumo_covid, alldiag_bactpneumo_factor),predict(lr_cls_fit_bactpneumo_covid, alldiag_bactpneumo_factor, type ="prob"))lr_training_pred_bactpneumo_covid
A positive coefficient (1.69) and a p-value less than 0.001 indicates that there is a very strong positive association between bacterial pneumonia and COVID diagnosis, regardless of sex or age.
However, the predictive ability of the logistic regression is poor.
# determine whether COVID is associated with C. difficile colitisalldiag_cdiff_binary <- alldiag_covid_binary |>mutate(Cdiff =ifelse(rowSums(select(alldiag_covid_binary, 21:50) =="A047", na.rm =TRUE) >0, 1, 0))alldiag_cdiff_factor <- alldiag_cdiff_binary %>%mutate(COVID =factor(COVID, levels =c(0, 1), labels =c("Negative", "Positive")),Cdiff =factor(Cdiff, levels =c(0,1), labels =c("Negative", "Positive")))# Create the plot with labeled factorsggplot(alldiag_cdiff_factor, aes(x = Cdiff, fill = COVID)) +geom_bar(position ="fill") +scale_fill_manual(values =c("Negative"="#aaaaaa", "Positive"="#b20000")) +theme_bw() +labs(x ="C. difficile infection", y ="Proportion", fill ="COVID Status")
# Fit the linear modelcovid.cdiff.fit <-glm(Cdiff ~ COVID, data = alldiag_cdiff_binary, family = binomial)summary(covid.cdiff.fit)
Call:
glm(formula = Cdiff ~ COVID, family = binomial, data = alldiag_cdiff_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.89046 0.03270 -149.548 <2e-16 ***
COVID 0.03812 0.14568 0.262 0.794
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11690 on 132693 degrees of freedom
Residual deviance: 11690 on 132692 degrees of freedom
AIC: 11694
Number of Fisher Scoring iterations: 7
# adjust for sexcovid.cdiff.sex.fit <-glm(Cdiff ~ SEX + COVID, data = alldiag_cdiff_binary, family = binomial)summary(covid.cdiff.sex.fit)
Call:
glm(formula = Cdiff ~ SEX + COVID, family = binomial, data = alldiag_cdiff_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.86594 0.04729 -102.888 <2e-16 ***
SEXFemale -0.04387 0.06389 -0.687 0.492
COVID 0.03463 0.14575 0.238 0.812
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11689 on 132588 degrees of freedom
Residual deviance: 11688 on 132586 degrees of freedom
(105 observations deleted due to missingness)
AIC: 11694
Number of Fisher Scoring iterations: 7
# adjust for agecovid.cdiff.age.fit <-glm(Cdiff ~ AGE + COVID, data = alldiag_cdiff_binary, family = binomial)summary(covid.cdiff.age.fit)
Call:
glm(formula = Cdiff ~ AGE + COVID, family = binomial, data = alldiag_cdiff_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.118076 0.096028 -63.711 <2e-16 ***
AGE 0.022290 0.001472 15.145 <2e-16 ***
COVID -0.162522 0.146031 -1.113 0.266
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11689 on 132613 degrees of freedom
Residual deviance: 11419 on 132611 degrees of freedom
(80 observations deleted due to missingness)
AIC: 11425
Number of Fisher Scoring iterations: 8
There is no association between C. difficile colitis and COVID-19 diagnosis.
# determine whether COVID diagnosis is associated with Enterococcus infectionalldiag_entero_binary <- alldiag_covid_binary |>mutate(Entero =ifelse(rowSums(select(alldiag_covid_binary, 21:50) == entero_code, na.rm =TRUE) >0, 1, 0))alldiag_entero_factor <- alldiag_entero_binary %>%mutate(COVID =factor(COVID, levels =c(0, 1), labels =c("Negative", "Positive")),Entero =factor(Entero, levels =c(0,1), labels =c("Negative", "Positive")))# Create the plot with labeled factorsggplot(alldiag_entero_factor, aes(x = Entero, fill = COVID)) +geom_bar(position ="fill") +scale_fill_manual(values =c("Negative"="#aaaaaa", "Positive"="#b20000")) +theme_bw() +labs(x ="Enterococcus infection", y ="Proportion", fill ="COVID Status")
# Fit the linear modelcovid.entero.fit <-glm(Entero ~ COVID, data = alldiag_entero_binary, family = binomial)summary(covid.entero.fit)
Call:
glm(formula = Entero ~ COVID, family = binomial, data = alldiag_entero_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.21056 0.06293 -98.686 <2e-16 ***
COVID 0.21349 0.25810 0.827 0.408
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3873.6 on 132693 degrees of freedom
Residual deviance: 3873.0 on 132692 degrees of freedom
AIC: 3877
Number of Fisher Scoring iterations: 9
# adjust for sexcovid.entero.sex.fit <-glm(Entero ~ SEX + COVID, data = alldiag_entero_binary, family = binomial)summary(covid.entero.sex.fit)
Call:
glm(formula = Entero ~ SEX + COVID, family = binomial, data = alldiag_entero_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.01866 0.08378 -71.837 < 2e-16 ***
SEXFemale -0.38381 0.12284 -3.124 0.00178 **
COVID 0.18710 0.25824 0.725 0.46873
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3873.2 on 132588 degrees of freedom
Residual deviance: 3862.7 on 132586 degrees of freedom
(105 observations deleted due to missingness)
AIC: 3868.7
Number of Fisher Scoring iterations: 9
There is no association between Enterococus infection and COVID-19 diagnosis.
Age is positively associated with COVID-19 diagnosis.
Emergency department data set
I also looked at the emergency department data set. Note, I did not include this in the results section as majority of patients (>80%) did not stay more than a day.
# Read in NHCS 2020 emergency department R Dataseturl2 <-"https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHCS/2020/R/nhcs2020ed_r.rds"nhcs2020ed <-read_rds(url2)
# emergency department dataset#pull out variables listvar_2020_ed <- nhcs2020ed$variables#select useful columnsvar_2020_ed_select <-select(var_2020_ed, 1:37)#variable names varnames_2020_ed <-colnames(var_2020_ed_select)#pull out primary diagnosis (DX1)diag1_2020_ed <- var_2020_ed_select$DX1#determine the number of primary C. diff infections (ICD-10 diagnosis code: A047)primaryCdiff_2020_ed <-sum(diag1_2020_ed =="A047", na.rm =TRUE)
# list primary diagnosis by countvar_2020_ed_select |>group_by(DX1) |>summarize(count =n()) |>arrange(desc(count))
# the code doesn't match exactly, for e.g. R078 has R0781, R0782, R0789 so can't accurately add descriptions. Left as is. The highest counts are chest pain, COVID-19, respiratory infection, joint pain, abdominal pain.
I next determined the proportion of emergency department cases that was “routine to home”, i.e. did not stay more than 1 day.
# A tibble: 9 × 2
DISCHARGE_STATUS count
<fct> <int>
1 Routine to home 314022
2 <NA> 27915
3 Other 22527
4 Left against medical advice 7554
5 Home health care 6093
6 Transfer to short term facility 5279
7 Transfer to long term facility 2115
8 Dead 2063
9 Hospice care - home or medical facility 1185
#visualizationggplot(covid_patients_ed, aes(x = SEX)) +geom_bar() +labs(x ="Sex", y ="Count") +ggtitle("COVID-19 cases by sex") +theme_bw()
ggplot(covid_patients_ed, aes(x = age.simplified)) +geom_bar() +labs(x ="Age", y ="Count") +ggtitle("COVID-19 cases by age") +theme_bw()
ggplot(covid_patients_ed, aes(x = DISCHARGE_MONTH)) +geom_bar() +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("COVID-19 cases by discharge month") +theme_bw()
#subset bacterial pneumonia patientsbactpneumo_patients_ed <-filter(alldiag_2020_ed, DX %in%c("J13", "J14", "J150", "J151", "J1520", "J15211", "J1529", "J153", "J154", "J155", "J1561", "J1569", "J157", "J158", "J159", "J160"))#J13 Pneumonia due to Streptococcus pneumoniae#J14 Pneumonia due to Hemophilus influenzae#J150 Pneumonia due to Klebsiella pneumoniae#J151 Pneumonia due to Pseudomonas#J1520 Pneumonia due to staphylococcus, unspecified#J15211 Pneumonia due to Methicillin susceptible Staphylococcus aureus#J15212 Pneumonia due to Methicillin resistant Staphylococcus aureus#J1529 Pneumonia due to other staphylococcus#J153 Pneumonia due to streptococcus, group B#J154 Pneumonia due to other streptococci#J155 Pneumonia due to Escherichia coli#J1561 Pneumonia due to Acinetobacter baumannii#J1569 Pneumonia due to other Gram-negative bacteria#J157 Pneumonia due to Mycoplasma pneumoniae#J158 Pneumonia due to other specified bacteria#J159 Unspecified bacterial pneumonia#J160 Chlamydial pneumonia
#plot number of cases by demographics#re-level ageCdiff_patients_ed <- Cdiff_patients_ed |>mutate(age.simplified =cut(AGE, breaks =c(0, 5, 10, 15, 20, seq(30, 100, by =10)), right =FALSE, labels =c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))#count(Cdiff_patients_ed, age.simplified)#count(Cdiff_patients_ed, SEX)#count(Cdiff_patients_ed, DISCHARGE_MONTH)#count(Cdiff_patients_ed, DISCHARGE_STATUS)
#visualizationggplot(Cdiff_patients_ed, aes(x = SEX)) +geom_bar() +labs(x ="Sex", y ="Count") +ggtitle("C. difficile infection by sex") +theme_bw()
ggplot(Cdiff_patients_ed, aes(x = age.simplified)) +geom_bar() +labs(x ="Age", y ="Count") +ggtitle("C. difficile infection by age") +theme_bw()
ggplot(Cdiff_patients_ed, aes(x = DISCHARGE_MONTH)) +geom_bar() +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("C. difficile infection by discharge month") +theme_bw()
#MRSAmrsa_patients_ed <-filter(alldiag_2020_ed, DX %in%c("A4102", "A4902", "B9562", "J15212"))#A4102 Sepsis due to Methicillin resistant Staphylococcus aureus#A4902 Methicillin resistant Staphylococcus aureus infection, unspecified site#B9562 Methicillin resistant Staphylococcus aureus infection as the cause of diseases classified elsewhere#J15212 Pneumonia due to Methicillin resistant Staphylococcus aureus# no MRSA cases in dataset
#enterococcusentero_patients_ed <-filter(alldiag_2020_ed, DX %in%c("A4181", "B952"))#A4181 Sepsis due to Enterococcus#B952 Enterococcus as the cause of diseases classified elsewhere
#plot number of cases by demographics#re-level ageentero_patients_ed <- entero_patients_ed |>mutate(age.simplified =cut(AGE, breaks =c(0, 5, 10, 15, 20, seq(30, 100, by =10)), right =FALSE, labels =c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))#count(entero_patients_ed, age.simplified)#count(entero_patients_ed, SEX)#count(entero_patients_ed, DISCHARGE_MONTH)#count(entero_patients_ed, DISCHARGE_STATUS)
#visualizationggplot(entero_patients_ed, aes(x = SEX)) +geom_bar() +labs(x ="Sex", y ="Count") +ggtitle("Enterococcus infections by sex") +theme_bw()
ggplot(entero_patients_ed, aes(x = age.simplified)) +geom_bar() +labs(x ="Age", y ="Count") +ggtitle("Enterococcus infections by age") +theme_bw()
ggplot(entero_patients_ed, aes(x = DISCHARGE_MONTH)) +geom_bar() +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("Enterococcus infections by discharge month") +theme_bw()
#infections from catheterinfcatheter_patients_ed <-filter(alldiag_2020_ed, DX %in%c("T80211A", "T80211D ", "T80211S", "T80212A", "T80212D", "T80212S", "T80218A", "T80218D", "T80218S", "T80219A", "T80219D", "T80219S"))#T80211A Bloodstream infection due to central venous catheter, initial encounter#T80211D Bloodstream infection due to central venous catheter, subsequent encounter#T80211S Bloodstream infection due to central venous catheter, sequela#T80212A Local infection due to central venous catheter, initial encounter#T80212D Local infection due to central venous catheter, subsequent encounter#T80212S Local infection due to central venous catheter, sequela#T80218A Other infection due to central venous catheter, initial encounter#T80218D Other infection due to central venous catheter, subsequent encounter#T80218S Other infection due to central venous catheter, sequela#T80219A Unspecified infection due to central venous catheter, initial encounter#T80219D Unspecified infection due to central venous catheter, subsequent encounter#T80219S Unspecified infection due to central venous catheter, sequela# no catheter-associated infections in dataset
# surgical site infectionssurginf_patients_ed <-filter(alldiag_2020_ed, DX %in%c("T8141XA", "T8141XD", "T8141XS", "T8142XA", "T8142XD", "T8142XS", "T8143XA", "T8143XD", "T8143XS", "O8600", "O8601", "O8602", "O8603", "O8604", "08609"))#T8141XA Infection following a procedure, superficial incisional surgical site, initial encounter#T8141XD Infection following a procedure, superficial incisional surgical site, subsequent encounter#T8141XS Infection following a procedure, superficial incisional surgical site, sequela#T8142XA Infection following a procedure, deep incisional surgical site, initial encounter#T8142XD Infection following a procedure, deep incisional surgical site, subsequent encounter#T8142XS Infection following a procedure, deep incisional surgical site, sequela#T8143XA Infection following a procedure, organ and space surgical site, initial encounter#T8143XD Infection following a procedure, organ and space surgical site, subsequent encounter#T8143XS Infection following a procedure, organ and space surgical site, sequela#O8600 Infection of obstetric surgical wound, unspecified#O8601 Infection of obstetric surgical wound, superficial incisional site#O8602 Infection of obstetric surgical wound, deep incisional site#O8603 Infection of obstetric surgical wound, organ and space site#O8604 Sepsis following an obstetrical procedure#O8609 Infection of obstetric surgical wound, other surgical site# no cases in dataset
I plotted each hospital acquired infection by month to visualize seasonality.
#plot month data together covid_month_ed <-ggplot(covid_patients_ed, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +theme_bw() +theme(legend.position ="none")#adjust margins for cowplotcovid_month_ed <- covid_month_ed +theme(plot.margin =margin(t =20, r =10, b =10, l =10))bactpneumo_month_ed <-ggplot(bactpneumo_patients_ed, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +theme_bw() +theme(legend.position ="none")#adjust margins for cowplotbactpneumo_month_ed <- bactpneumo_month_ed +theme(plot.margin =margin(t =20, r =10, b =10, l =10))cdiff_month_ed <-ggplot(Cdiff_patients_ed, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +theme_bw() +theme(legend.position ="none")#adjust margins for cowplotcdiff_month_ed <- cdiff_month_ed +theme(plot.margin =margin(t =20, r =10, b =10, l =10))entero_month_ed <-ggplot(entero_patients_ed, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +theme_bw() +theme(legend.position ="none")entero_month_ed <- entero_month_ed +theme(plot.margin =margin(t =20, r =10, b =10, l =10))plot_grid(covid_month_ed, bactpneumo_month_ed, cdiff_month_ed, entero_month_ed, labels =c('COVID-19', 'Bacterial pneumonia', 'C. difficile', 'Enterococcus'), label_size =12, label_x =-0.05, label_y =1)
# similar trend between inpatient and emergency room (dipped in the summer, surged in the winter)plot_grid(bactpneumo_month, bactpneumo_month_ed, labels =c('Bact pneumonia (inpatient)', 'Bact pneumonia (ER)'), label_size =12, label_x =-0.05, label_y =1)
# similar trend to covid; no difference between inpatient and ERplot_grid(cdiff_month, cdiff_month_ed, labels =c('C.diff (inpatient)', 'C.diff (emergency room)'), label_size =12, label_x =-0.05, label_y =1)
# more seasonality in the emergency room data for C. diffplot_grid(entero_month, entero_month_ed, labels =c('Enterococcus (inpatient)', 'Enterococcus (ER)'), label_size =12, label_x =-0.05, label_y =1)
#ER data seamingly random (surge in June); doesn't seem to coincide with COVID case
Overall, the inpatient data and emergency room data follows a similar trend by month. There may be more seasonality in the C. difficile emergency room data however, with a drop in cases between April and August.
4 Results
I chose to include the inpatient data only, because >80% of emergency room visitors were sent home while only 16% of inpatient cases were sent home the same day. Length of stay in the hospital is a major factor in acquiring hospital-associated infection. Overall the month-to-month trends for COVID-19 and hospital infections between inpatient and emergency room visits appeared similar.
prop_ip_routine
sum(count)
1 0.1610849
# 16% of inpatient visits stayed only 1 day
prop_ed_routine
sum(count)
1 0.8077674
# 81% routine to home from emergency deparment
All 4 infections (COVID-19, bacterial pneumonia, C. difficile, Enterococcus) predominantly affected older populations. For all 4, the highest number of cases was between the ages of 60-69.
I also found that there is a negative association between female sex and being diagnosed for COVID-19, bacterial pneumonia, and Enterococcus infection. This is consistent with published reports. COVID-19 infection was more prevalent in men than women in the early phases of the pandemic (Abate et al. 2020) and the severity and outcomes from COVID-19 was worse in men (Pradhan and Olsson. 2020). Men also have been reported to develop pneumonia more frequently than women (Gay et al. 2021). Studies have also shown that bloodstreamn infections with vancomycin-resistant enterococci (VRE) has a male preference (Correa-Martinez et al. 2021).
The incidence of C. difficile meanwhile was not significantly associated with a particular sex. While published reports are mixed, some studies have shown that C. difficile infection encounters are more often female (Young et al. 2022), which is in line with the slightly higher case count in this data set. Together, this analysis shows that the sex-bias in bacterial pneumonia, Enterococcus, and C. difficile infection was not affected by the COVID-19 pandemic in 2020.
As expected, there was a correlation between discharge month (which I used as a proxy for diagnosis month since that wasn’t available) and COVID-19 diagnosis. Relative to April, COVID-19 was negatively associated with the later months. It also appears that there is a seasonality to COVID-19 cases, where cases dropped in the summer months and surged in the winter.
ggplot(covid_patients, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_manual(values = colors_set3) +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("COVID-19 cases by discharge month") +theme_bw() +theme(legend.position ="none")
summary(covid.month.fit)
Call:
glm(formula = COVID ~ DISCHARGE_MONTH, family = binomial, data = covid_binary_month)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.33340 0.02855 -151.786 < 2e-16 ***
DISCHARGE_MONTHJanuary -7.50209 1.00041 -7.499 6.43e-14 ***
DISCHARGE_MONTHFebruary -16.23267 49.74714 -0.326 0.7442
DISCHARGE_MONTHMarch -5.76748 0.44813 -12.870 < 2e-16 ***
DISCHARGE_MONTHMay -0.60721 0.04544 -13.363 < 2e-16 ***
DISCHARGE_MONTHJune -1.31565 0.05666 -23.221 < 2e-16 ***
DISCHARGE_MONTHJuly -1.35050 0.05550 -24.334 < 2e-16 ***
DISCHARGE_MONTHAugust -1.54094 0.05988 -25.735 < 2e-16 ***
DISCHARGE_MONTHSeptember -1.90079 0.06933 -27.416 < 2e-16 ***
DISCHARGE_MONTHOctober -1.36564 0.05500 -24.829 < 2e-16 ***
DISCHARGE_MONTHNovember -0.58515 0.04379 -13.364 < 2e-16 ***
DISCHARGE_MONTHDecember -0.09776 0.03829 -2.553 0.0107 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83171 on 1498051 degrees of freedom
Residual deviance: 76909 on 1498040 degrees of freedom
AIC: 76933
Number of Fisher Scoring iterations: 19
Bacterial pneumonia had a similar seasonality as COVID-19, dropping from June to October and surging in the colder months. Analysis of the association of COVID-19 infection and bacterial pneumonia diagnosis reveals a significant positive association, indicating that COVID-19 infection increases the risk of bacterial pneumonia. Secondary bacterial infection is a important contributor to mortality following viral infection, including COVID-19 (Morens et al. 2008, Gao et al. 2023). Interestingly, the incidence of bacterial pneumonia by age changed throughout 2020, where children under the age of 10 that made up nearly 20% of cases in January significantly decreased, likely due to changes in exposure from the COVID-19 pandemic.
ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("Bacterial penumonia cases by discharge month") +theme_bw() +theme(legend.position ="none")
summary(bactpneumo.month.fit)
Call:
glm(formula = bactpneumo ~ factor(DISCHARGE_MONTH), family = binomial,
data = bactpneumo_binary_month)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.61100 0.07334 -103.774 < 2e-16 ***
factor(DISCHARGE_MONTH)2 -0.29591 0.11460 -2.582 0.009821 **
factor(DISCHARGE_MONTH)3 0.12533 0.10358 1.210 0.226310
factor(DISCHARGE_MONTH)4 0.27074 0.10598 2.555 0.010626 *
factor(DISCHARGE_MONTH)5 0.08101 0.10749 0.754 0.451060
factor(DISCHARGE_MONTH)6 -0.36646 0.12065 -3.037 0.002387 **
factor(DISCHARGE_MONTH)7 -0.51773 0.12323 -4.201 2.65e-05 ***
factor(DISCHARGE_MONTH)8 -0.41547 0.11996 -3.463 0.000534 ***
factor(DISCHARGE_MONTH)9 -0.70243 0.13258 -5.298 1.17e-07 ***
factor(DISCHARGE_MONTH)10 -0.39638 0.11741 -3.376 0.000735 ***
factor(DISCHARGE_MONTH)11 -0.15140 0.11238 -1.347 0.177909
factor(DISCHARGE_MONTH)12 0.05973 0.10518 0.568 0.570118
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 29344 on 3980819 degrees of freedom
Residual deviance: 29215 on 3980808 degrees of freedom
AIC: 29239
Number of Fisher Scoring iterations: 11
# Create the plot with labeled factorsggplot(alldiag_bactpneumo_factor, aes(x = bactpneumo, fill = COVID)) +geom_bar(position ="fill") +scale_fill_manual(values =c("Negative"="#aaaaaa", "Positive"="#b20000")) +theme_bw() +labs(x ="Bacterial Pneumonia", y ="Proportion", fill ="COVID Status")
summary(covid.bactpneumo.fit)
Call:
glm(formula = bactpneumo ~ COVID, family = binomial, data = alldiag_bactpneumo_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.2344 0.1049 -68.99 < 2e-16 ***
COVID 1.6850 0.2262 7.45 9.33e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1865.7 on 132693 degrees of freedom
Residual deviance: 1826.3 on 132692 degrees of freedom
AIC: 1830.3
Number of Fisher Scoring iterations: 10
ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +geom_bar(position ="fill") +scale_fill_brewer(palette ="Spectral") +scale_x_continuous(breaks =1:12) +labs(x ="Discharge Month", y ="Proportion", fill ="Age") +ggtitle("Bacterial pneumonia cases") +theme_bw()
bactpneumo_under10_plot
I next performed a similar analysis for two bacterial pathogens commonly associated with hospital-associated infections outside the lungs. There was no clear seasonality with C. difficile infection; cases remained relatively stable throughout the year. Additionally, COVID-19 infection was not associated with C. difficile infection. A study that performed a single-center comparison in hospital-associated C. difficile infection between 2019 (pre-pandemic) and 2020 (pandemic) also found the incidence remained unchanged; however, their study found that the incidence of hospital-associated C. difficile infection was higher in patients with COVID-19 than in non-COVID-19 cases which they attributed to longer hospital stay and increased use of broad-spectrum antibiotics (Vazquez-Cuesta et al. 2022). The discrepancies between their study and my analysis may be reflective of the limitations of a single-center analysis. Several studies reported a lower C. difficile infection rate compared to pre-pandemic period (Granata et al. 2022).
ggplot(Cdiff_patients, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("C. difficile infection by discharge month") +theme_bw() +theme(legend.position ="none")
summary(cdiff.month.fit)
Call:
glm(formula = cdiff ~ factor(DISCHARGE_MONTH), family = binomial,
data = cdiff_binary_month)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.161890 0.096711 -74.054 <2e-16 ***
factor(DISCHARGE_MONTH)2 -0.182908 0.146820 -1.246 0.2128
factor(DISCHARGE_MONTH)3 -0.015104 0.141824 -0.106 0.9152
factor(DISCHARGE_MONTH)4 -0.150278 0.158077 -0.951 0.3418
factor(DISCHARGE_MONTH)5 -0.228811 0.153778 -1.488 0.1368
factor(DISCHARGE_MONTH)6 -0.097025 0.145829 -0.665 0.5058
factor(DISCHARGE_MONTH)7 -0.287661 0.150062 -1.917 0.0552 .
factor(DISCHARGE_MONTH)8 -0.315877 0.151853 -2.080 0.0375 *
factor(DISCHARGE_MONTH)9 -0.003822 0.139506 -0.027 0.9781
factor(DISCHARGE_MONTH)10 -0.263541 0.147330 -1.789 0.0736 .
factor(DISCHARGE_MONTH)11 -0.150290 0.145828 -1.031 0.3027
factor(DISCHARGE_MONTH)12 -0.259636 0.148387 -1.750 0.0802 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16506 on 1498051 degrees of freedom
Residual deviance: 16494 on 1498040 degrees of freedom
(2482768 observations deleted due to missingness)
AIC: 16518
Number of Fisher Scoring iterations: 10
# Create the plot with labeled factorsggplot(alldiag_cdiff_factor, aes(x = Cdiff, fill = COVID)) +geom_bar(position ="fill") +scale_fill_manual(values =c("Negative"="#aaaaaa", "Positive"="#b20000")) +theme_bw() +labs(x ="C. difficile infection", y ="Proportion", fill ="COVID Status")
summary(covid.cdiff.fit)
Call:
glm(formula = Cdiff ~ COVID, family = binomial, data = alldiag_cdiff_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.89046 0.03270 -149.548 <2e-16 ***
COVID 0.03812 0.14568 0.262 0.794
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11690 on 132693 degrees of freedom
Residual deviance: 11690 on 132692 degrees of freedom
AIC: 11694
Number of Fisher Scoring iterations: 7
Similarly, the incidence of Enterococcus infection did not have a clear seasonality and there was a lack of association between COVID-19 diagnosis and Enterococcus co-infection.
ggplot(entero_patients, aes(x = DISCHARGE_MONTH, fill =factor(DISCHARGE_MONTH))) +geom_bar() +scale_fill_brewer(palette ="Set3") +scale_x_continuous(breaks =1:12) +labs(x ="Month", y ="Count") +ggtitle("Enterococcus infections by discharge month") +theme_bw() +theme(legend.position ="none")
summary(entero.month.fit)
Call:
glm(formula = entero ~ factor(DISCHARGE_MONTH), family = binomial,
data = entero_binary_month)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.88586 0.13868 -64.072 <2e-16 ***
factor(DISCHARGE_MONTH)2 -0.16754 0.20887 -0.802 0.4225
factor(DISCHARGE_MONTH)3 -0.04716 0.20485 -0.230 0.8179
factor(DISCHARGE_MONTH)4 -0.07022 0.22057 -0.318 0.7502
factor(DISCHARGE_MONTH)5 0.19973 0.19709 1.013 0.3109
factor(DISCHARGE_MONTH)6 -0.11961 0.21184 -0.565 0.5723
factor(DISCHARGE_MONTH)7 0.15738 0.19260 0.817 0.4139
factor(DISCHARGE_MONTH)8 -0.41638 0.22692 -1.835 0.0665 .
factor(DISCHARGE_MONTH)9 0.01575 0.20128 0.078 0.9376
factor(DISCHARGE_MONTH)10 -0.02964 0.20017 -0.148 0.8823
factor(DISCHARGE_MONTH)11 0.02457 0.20242 0.121 0.9034
factor(DISCHARGE_MONTH)12 0.13403 0.19520 0.687 0.4923
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10733 on 3980819 degrees of freedom
Residual deviance: 10721 on 3980808 degrees of freedom
AIC: 10745
Number of Fisher Scoring iterations: 12
# Create the plot with labeled factorsggplot(alldiag_entero_factor, aes(x = Entero, fill = COVID)) +geom_bar(position ="fill") +scale_fill_manual(values =c("Negative"="#aaaaaa", "Positive"="#b20000")) +theme_bw() +labs(x ="Enterococcus infection", y ="Proportion", fill ="COVID Status")
summary(covid.entero.fit)
Call:
glm(formula = Entero ~ COVID, family = binomial, data = alldiag_entero_binary)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.21056 0.06293 -98.686 <2e-16 ***
COVID 0.21349 0.25810 0.827 0.408
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3873.6 on 132693 degrees of freedom
Residual deviance: 3873.0 on 132692 degrees of freedom
AIC: 3877
Number of Fisher Scoring iterations: 9
5 Conclusion
Overall my findings indicate that the COVID-19 pandemic did not significantly influence the incidence of two common hospital-acquired infections (C. difficile and Enterococcus) in the early phase of the pandemic. As expected, bacterial pneumonia which is commonly associated with respiratory viral infections had a similar seasonality as COVID-19 and was significantly associated with COVID-19 infection.
This study was limited by the data set available. Ideally, I would’ve liked to compare the results from 2020 to pre-pandemic levels (2019) and post-pandemic data (2022-). However, these data sets were restricted and not publicly available. Another limitation of this data set was the lack of incidence of other common hospital-acquired infections (e.g. central-line bloodstream infections, catheter-associated UTIs, and MRSA). For example, a study by the CDC found that the incidence of central-line bloodstream infections, catheter-associated UTIs, ventilator-associated infections, and MRSA bacteremia was significantly higher in 2021 than pre-pandemic levels (Lastinger et al. 2022). It would have also been nice to have geographical data as well, since certain regions were more heavily affected by the pandemic than others.
Young et al. (2022) Clostridioides difficile Infection Treatment and Outcome Disparities in a National Sample of United States Hospitals. Antibiotics.https://www.mdpi.com/2079-6382/11/9/1203
Morens et al. (2008) Predominant role of bacterial pneumonia as a cause of death in pandemic influenza: implications for pandemic influenza preparedness. J Infect Dis.
Gao et al. (2023) Machine learning links unresolving secondary pneumonia to mortality in patients with severe pneumonia, including COVID-19. JCI.https://www.jci.org/articles/view/170682#B17